New York always being a fascinating city with many visitors year-round, making tourism a thriving business. Yet Manhattan is still the center or visitors’ people the cost of hotels in Manhattan are quite expensive.
Brooklyn have become trendy with arts and music scene due to the influence of different cultures specially black communities yet some of them are suffering gentrification with the new investments in infrastructure the proximity to Manhattan and access to subway a cheap means of transport it is ideal place to set trendy Bed and Breakfast turning as an example a Brownstone 4 story home into a good business for locals and families instead of rentals.
The idea is to provide a possible investor with the best possible information to decide based on requirements:
1.6km radius from Barclay's Center
Not too crowded with B&B or similar business
With variety of venues such as Cafe's, Restaurants and Pubs.
Non requirements but will give the exercise an added Value.
Make a competitive analysis: if possible, get the description of the competitors on Sq-meters/sq-foot size building.
Check the area under study for properties with similar size and its price
Select a few properties and check its close neighborhood for venues and amenities.
Make a recommendation
We will use the skills and tools learned in IBM-Coursera Specialization to complete with the requirements.
With the problem in hand, we need the following information:
Identify the B&B plus similar business in a radius of 1.6km surrounding the Barclay's Center.
✔ This will fill 1st and 2nd requirement
✔ This will put neighborhoods to "compete" for the investment fulfilling 3rd requirement
Additional information:
✔ Gather Address, Square meter, features and price and Realtor contact.
Sources and tools:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
from lxml import html
import unicodecsv as csv
import json # library to handle JSON files
import io as io
import time
##!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
# libraries for displaying images
from IPython.display import Image
from IPython.core.display import HTML
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize
#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
import pylab as pl
# import k-means from clustering stage
from sklearn.cluster import KMeans
print('Folium installed')
print('Libraries imported.')
## My foursquare credentials :
CLIENT_ID = 'I41GAWTM30NPWPJD1DPVQU15HXPQODVTUFS4YAO1TCUBBBDO' # your Foursquare ID
CLIENT_SECRET = 'UJUYKYPXU0QU4NOHOZSI45U1CG1TUZBDH1Q3JOLSGMNSCYLF' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 100
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
## Barclays Center address get geolocation.
address = "Barclays Center, Brooklyn, NY"
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('\033[94m'+'Barclays Center, Brooklyn, NY latitude=',latitude, 'longitude =', longitude )
https://api.foursquare.com/v2/venues/search?client_id=CLIENT_ID&client_secret=CLIENT_SECRET&ll=LATITUDE,LONGITUDE&v=VERSION&query=QUERY&radius=RADIUS&limit=LIMIT
## radius of 1600 meters and limit =100 (we want all 😁 )
search_query = 'Bed and Breakfast'
radius = 1600
LIMIT = 100
print(search_query + ' .... OK!')
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
url
# Storing result in bnb variable, we will do the same for hostel
results_bnb = requests.get(url).json()
#commented to make notebook readable
#results_bnb
## radius of 1600 meters and limit =100 (we want all 😁 )
search_query = 'hostel'
radius = 1600
LIMIT = 100
print(search_query + ' .... OK!')
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
url
# Storing result in host variable
results_host = requests.get(url).json()
#commented to make notebook readable
#results_host
## radius of 1600 meters and limit =100 (we want all 😁 )
#search_query = 'cafe'
radius = 1600
LIMIT = 500
print('All .... OK!')
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
url
# Storing result in host variable
results_venues = requests.get(url).json()
#commented to make notebook readable
#results_venues
# assign relevant part of JSON to venues
venues_bnb = results_bnb['response']['venues']
venues_host = results_host['response']['venues']
results_venues = results_venues['response']['venues']
# tranform venues into a dataframe
dataframe_bnb = pd.json_normalize(venues_bnb)
dataframe_host = pd.json_normalize(venues_host)
dataframe_venues = pd.json_normalize(results_venues)
# Checking the dataframe information as a good practice
#commented to make notebook readable
#dataframe_bnb.info()
# Checking the dataframe information as a good practice
#commented to make notebook readable
#dataframe_host.info()
# Checking the dataframe information as a good practice
#commented to make notebook readable
#dataframe_venues.info()
## merging the 3 dataframes
dataframe_bnb_venues = pd.concat([dataframe_bnb, dataframe_host, dataframe_venues],ignore_index=True)
dataframe_bnb_venues.head()
# keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in dataframe_bnb_venues.columns if col.startswith('location.')] + ['id']
dataframe_filtered_bnb_venues = dataframe_bnb_venues.loc[:, filtered_columns]
# function that extracts the category of the venue
def get_category_type(row):
try:
categories_list = row['categories']
except:
categories_list = row['venue.categories']
if len(categories_list) == 0:
return None
else:
return categories_list[0]['name']
# filter the category for each row
dataframe_filtered_bnb_venues['categories'] = dataframe_filtered_bnb_venues.apply(get_category_type, axis=1)
# clean column names by keeping only last term
dataframe_filtered_bnb_venues.columns = [column.split('.')[-1] for column in dataframe_filtered_bnb_venues.columns]
#checking that all venues were added, also used tails to reduce print out
dataframe_filtered_bnb_venues.tail()
#adding neighbourhood and reordering columns
dataframe_filtered_bnb_venues['neighbourhood'] = np.nan
dataframe_filtered_bnb_venues=dataframe_filtered_bnb_venues[['name','categories','address','neighbourhood','distance','lat','lng','postalCode','cc','city','state','country','formattedAddress','labeledLatLngs','crossStreet','id']]
# reduce it to print out to just 10 items remove head to see all 138 venues
dataframe_filtered_bnb_venues.name.head(10)
##Checking dataset for NA values to fix
## reducing print out
dataframe_filtered_bnb_venues['address'].notna().head()
### need to filter because there are businesses and bus stops that are not part of what i need to analyze
categ_filter = ['Hostel', 'Bed & Breakfast']
dataframe_filtered_bnb = dataframe_filtered_bnb_venues[dataframe_filtered_bnb_venues.categories.isin(categ_filter)]
dataframe_filtered_bnb = dataframe_filtered_bnb.reset_index(drop=True)
dataframe_filtered_bnb
## Get Address reverse this function takes the Lat and Long data and get an approximate address
def getaddress(latmiss, longmiss):
geolocator = Nominatim(user_agent="brooklyn_explorer")
geo_string = str(latmiss) + ', ' +str(longmiss)
loc_found = geolocator.reverse(geo_string)
#print(loc_found.raw)
addr_found = loc_found.raw
addr_found = addr_found['address']
address_found =str(addr_found)
address_f = street_f = city_f = postalCode_f= state_f = house_number_f = neighbourhood_f =""
city_f =str(addr_found['city'])
postalCode_f =str(addr_found['postcode'])
state_f =str(addr_found['state'])
neighbourhood_f = str(addr_found['neighbourhood'])
road_f =str(addr_found['road'])
house_number_f =str(addr_found['house_number'])
address_f = house_number_f +' '+ road_f
return (address_f, city_f, state_f, postalCode_f, neighbourhood_f)
## This is a check for invalid address with debug prints, it could be done better yet for now it is working
print('\033[1m'+ 'Validating adrress fields in dataframe if value Null we perform a reverse mapping using latitude and longitude data' +'\033[0m')
for index, frame in dataframe_filtered_bnb['neighbourhood'].iteritems():
if pd.notnull(frame):
print (" Valid address at index:", index)
else:
lat_miss = dataframe_filtered_bnb.lat[index]
long_miss= dataframe_filtered_bnb.lng[index]
business_name = dataframe_filtered_bnb.name[index]
found = getaddress(lat_miss, long_miss)
address_to_add = found[0]
city_to_add = found[1]
state_to_add = found[2]
postalCode_to_add = found[3]
neighbourhood_to_add = found[4]
print('\033[91m'+'Invalid/Null address on:\n\tIndex:',index, "\n\tBusiness:", business_name, "\n\tLat & Lng info: ", lat_miss, long_miss)
print ('\033[0m'+'\t\tAddress found: ',address_to_add , city_to_add, state_to_add, postalCode_to_add)
print('\t\t\tAdding to dataframe')
dataframe_filtered_bnb.loc[index, 'address'] = address_to_add
dataframe_filtered_bnb.loc[index,'city'] = city_to_add
dataframe_filtered_bnb.loc[index,'state'] = state_to_add
dataframe_filtered_bnb.loc[index, 'postalCode'] = postalCode_to_add
dataframe_filtered_bnb.loc[index, 'neighbourhood'] = neighbourhood_to_add
#print(dataframe_filtered_bnb[index])
dataframe_filtered_bnb
#unique neighbourhoods and count of BnB on each.
dataframe_filtered_bnb['neighbourhood'].value_counts()
###visualize them
venues_map = folium.Map(location=[latitude, longitude], zoom_start=13) # generate map centred around the Barclay Center
# add a red circle marker to represent the Barclay Center
folium.CircleMarker(
[latitude, longitude],
radius=10,
color='red',
popup='Barclays Center',
fill = True,
fill_color = 'red',
fill_opacity = 0.6
).add_to(venues_map)
# add the Bed & Breakfast as blue circle markers
for lat, lng, label in zip(dataframe_filtered_bnb.lat, dataframe_filtered_bnb.lng, dataframe_filtered_bnb.name):
folium.CircleMarker(
[lat, lng],
radius=5,
color='blue',
popup=label,
fill = True,
fill_color='blue',
fill_opacity=0.6
).add_to(venues_map)
# display map
venues_map
## get the list of neighbourhoods of interest which are close to the Barclays center
## already did a preprocessing and loading the data from a csv
brook_neigh = pd.read_csv(r'C:\Users\rcuberob\Downloads\neighborhoods_brooklyn_2.csv')
brook_neigh.tail(5)
##filter unique neighbourhoods
#keep first duplicate value
brook_neigh = brook_neigh.drop_duplicates(subset=['neighbourhood']).reset_index(drop=True)
brook_neigh.head()
## adding Latitude and Longitude columns to the neighbourhood dataframe
brook_neigh["Lat"] = ""
brook_neigh["Long"] = ""
brook_neigh
## function get latitude & longitude, based on a conventional address
def getlatlong(address_to_find):
geolocator = Nominatim(user_agent="foursquare_agent")
geo_address_str = str(address_to_find)
#print (address_to_find)
location2 = geolocator.geocode(address_to_find)
#print (location2)
latitude2 = location2.latitude
longitude2 = location2.longitude
#print('\033[94m'+ address_to_find + ' latitude=',latitude2, 'longitude =', longitude2 )
return (latitude2, longitude2)
## get the latitude and longitude of the addresses above
for index, frame in brook_neigh['House_number_name'].iteritems():
house = brook_neigh.House_number_name[index]
road = brook_neigh.road[index]
neigh = brook_neigh.neighbourhood[index]
sub = brook_neigh.suburb[index]
city = brook_neigh.city[index]
add = house+road+neigh+sub+city
get = getlatlong(add)
lat2 = get[0]
long2 = get[1]
print (add, get)
brook_neigh.Lat[index]=lat2
brook_neigh.Long[index]=long2
#Brooklyn neighbourhoods of interes with geo-coord.
brook_neigh
## Function that get the nearby venues of the neighbourhoods of interest.
def getNearbyVenues(names, latitudes, longitudes, radius=500):
venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
print(name)
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT)
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['neighbourhood',
'neighbourhood Latitude',
'neighbourhood Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
return(nearby_venues)
## Neighbourhoods of interest
brook_venues = getNearbyVenues(names=brook_neigh['neighbourhood'],
latitudes=brook_neigh['Lat'],
longitudes=brook_neigh['Long']
)
# Debug print to check if everything worked as expected
print(brook_venues.shape)
brook_venues.head()
brook_venues.groupby('neighbourhood').count()
print('There are {} uniques categories.'.format(len(brook_venues['Venue Category'].unique())))
#One Hot Encoding
brook_onehot = pd.get_dummies(brook_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
brook_onehot['neighbourhood'] = brook_venues['neighbourhood']
# move neighborhood column to the first column
fixed_columns = [brook_onehot.columns[-1]] + list(brook_onehot.columns[:-1])
brook_onehot = brook_onehot[fixed_columns]
brook_onehot.head()
brook_onehot.shape
brook_grouped = brook_onehot.groupby('neighbourhood').mean().reset_index()
brook_grouped
brook_grouped.shape
#top 5 venues per neighbourhood
num_top_venues = 5
for hood in brook_grouped['neighbourhood']:
print("----"+hood+"----")
temp = brook_grouped[brook_grouped['neighbourhood'] == hood].T.reset_index()
temp.columns = ['venue','freq']
temp = temp.iloc[1:]
temp['freq'] = temp['freq'].astype(float)
temp = temp.round({'freq': 2})
print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
print('\n')
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
## top 10 venues per neighbourhood
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['neighbourhood']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighbourhood_venues_sorted = pd.DataFrame(columns=columns)
neighbourhood_venues_sorted['neighbourhood'] = brook_grouped['neighbourhood']
for ind in np.arange(brook_grouped.shape[0]):
neighbourhood_venues_sorted.iloc[ind, 1:] = return_most_common_venues(brook_grouped.iloc[ind, :], num_top_venues)
neighbourhood_venues_sorted
## Recap on BnB on those neighbourhoods
#unique neighbourhoods and count of BnB on each.
dataframe_filtered_bnb['neighbourhood'].value_counts()
brook_neigh
#K-Clusters grouping the neighbourhoods in 3 clusters
kclusters = 3
brook_grouped_clustering = brook_grouped.drop('neighbourhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(brook_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:5]
# add clustering labels
neighbourhood_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
brook_merged = brook_neigh
# merge brook_grouped with brook_neigh to add latitude/longitude for each neighbourhood
brook_merged = brook_merged.join(neighbourhood_venues_sorted.set_index('neighbourhood'), on='neighbourhood')
brook_merged
## dropping some columns we do not need for now
col_2_drop= 'House_number_name', 'road', 'county', 'city', 'state', 'country'
brook_merged_short = brook_merged.drop(['House_number_name','road', 'county', 'city', 'state', 'country'],1)
brook_merged_short
##create map with clusters
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(brook_merged_short['Lat'], brook_merged_short['Long'], brook_merged_short['neighbourhood'], brook_merged_short['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=15,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(map_clusters)
map_clusters
brook_merged_short.loc[brook_merged_short['Cluster Labels'] == 0, brook_merged_short.columns[[0] + list(range(5, brook_merged_short.shape[1]))]]
brook_merged_short.loc[brook_merged_short['Cluster Labels'] == 1, brook_merged_short.columns[[0] + list(range(5, brook_merged_short.shape[1]))]]
brook_merged_short.loc[brook_merged_short['Cluster Labels'] == 2, brook_merged_short.columns[[0] + list(range(5, brook_merged_short.shape[1]))]]
##create map with clusters
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(brook_merged_short['Lat'], brook_merged_short['Long'], brook_merged_short['neighbourhood'], brook_merged_short['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=15,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(map_clusters)
#####Adding the BnB locations
folium.CircleMarker(
[latitude, longitude],
radius=10,
color='Black',
popup='Barclays Center',
fill = True,
fill_color = 'Black',
fill_opacity = 0.6
).add_to(map_clusters)
# add the Bed & Breakfast as blue circle markers
for lat, lng, label in zip(dataframe_filtered_bnb.lat, dataframe_filtered_bnb.lng, dataframe_filtered_bnb.name):
folium.CircleMarker(
[lat, lng],
radius=5,
color='blue',
popup=label,
fill = True,
fill_color='blue',
fill_opacity=0.6
).add_to(map_clusters)
map_clusters
brook_merged_short[['neighbourhood', 'postcode', 'Cluster Labels']]
dataframe_filtered_bnb[['name','categories','neighbourhood','postalCode']]
dataframe_filtered_bnb['postalCode'].value_counts()
#reading Zillow information on zip code 11205 vs 11217
brook_11205_zillow = pd.read_csv(r'C:\Users\rcuberob\Downloads\properties_11205_zillow.csv')
brook_11217_zillow = pd.read_csv(r'C:\Users\rcuberob\Downloads\properties_11217_zillow.csv')
brook_11205_zillow.head()
brook_11217_zillow.head()
feat_data_11205 = brook_11205_zillow[['type_of_property','price','beds','bath','area_sqft']]
feat_data_11205.hist()
plt.show()
feat_data_11217 = brook_11217_zillow[['price','beds','bath','area_sqft']]
feat_data_11217.hist(color="green")
plt.show()
#this shows a few outliers in property size removing those properties to have a better picture
brook_11205_zillow['area_sqft'].max()
#this shows a few outliers in property size removing those properties to have a better picture
brook_11217_zillow['area_sqft'].max()
brook_11205_zillow.drop(brook_11205_zillow[brook_11205_zillow['area_sqft']== 59800].index, inplace=True)
brook_11205_zillow['area_sqft'].max()
brook_11217_zillow.drop(brook_11217_zillow[brook_11217_zillow['area_sqft']== 13640].index, inplace=True)
brook_11217_zillow['area_sqft'].max()
feat_data_11205 = brook_11205_zillow[['type_of_property','price','beds','bath','area_sqft']]
feat_data_11205.hist()
plt.show()
feat_data_11217 = brook_11217_zillow[['type_of_property','price','beds','bath','area_sqft']]
feat_data_11217.hist(color = "green")
plt.show()
plt.scatter(feat_data_11205.price, feat_data_11205.area_sqft, color='blue')
plt.ylabel('square_ft')
plt.xlabel('price')
plt.show()
#creating train and test dataset
msk= np.random.rand(len(feat_data_11205)) < 0.7
train = feat_data_11205[msk]
test = feat_data_11205[~msk]
#Train Data Distribution
plt.scatter(train.price,train.area_sqft, color='blue')
plt.xlabel("price")
plt.ylabel('square_ft')
plt.show()
#Modeling
#using sklearn package to model data
from sklearn import linear_model
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['price']])
train_y = np.asanyarray(train[['area_sqft']])
regr.fit (train_x, train_y)
#the coeficients
print ('Coefficients: ', regr.coef_)
print ('Intercepts: ', regr.intercept_)
#plotting results
plt.scatter(train.price,train.area_sqft, color='blue')
plt.plot(train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-r')
plt.xlabel("price")
plt.ylabel('square_ft')
plt.show()
#Evaluation and Metrics
from sklearn.metrics import r2_score
test_x = np.asanyarray(test[['price']])
test_y = np.asanyarray(test[['area_sqft']])
test_y_hat = regr.predict(test_x)
print("Means Absolute error: %.2f" % np.mean(np.absolute(test_y_hat - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_hat - test_y)**2))
print("R2-score: %.2f" % r2_score(test_y_hat, test_y) )
feat_data_11205.corr()['price'].sort_values()
plt.scatter(feat_data_11217.price, feat_data_11217.area_sqft, color='green')
plt.ylabel('square_ft')
plt.xlabel('price')
plt.show()
#creating train and test dataset
msk= np.random.rand(len(feat_data_11217)) < 0.7
train = feat_data_11217[msk]
test = feat_data_11217[~msk]
#Train Data Distribution
plt.scatter(train.price,train.area_sqft, color='green')
plt.xlabel("price")
plt.ylabel('square_ft')
plt.show()
#Modeling
#using sklearn package to model data
from sklearn import linear_model
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['price']])
train_y = np.asanyarray(train[['area_sqft']])
regr.fit (train_x, train_y)
#the coeficients
print ('Coefficients: ', regr.coef_)
print ('Intercepts: ', regr.intercept_)
#plotting results
plt.scatter(train.price,train.area_sqft, color='green')
plt.plot(train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-r')
plt.xlabel("price")
plt.ylabel('square_ft')
plt.show()
#Evaluation and Metrics
from sklearn.metrics import r2_score
test_x = np.asanyarray(test[['price']])
test_y = np.asanyarray(test[['area_sqft']])
test_y_hat = regr.predict(test_x)
print("Means Absolute error: %.2f" % np.mean(np.absolute(test_y_hat - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_hat - test_y)**2))
print("R2-score: %.2f" % r2_score(test_y_hat, test_y) )
feat_data_11217.corr()['price'].sort_values()
feat_data_11217.groupby(['type_of_property']).mean()
feat_data_11217.groupby(['type_of_property']).mean()['price'].plot.bar(figsize=(12,7), color ='green')
feat_data_11205.groupby(['type_of_property']).mean()
feat_data_11205.describe()
feat_data_11205.groupby(['type_of_property']).mean()['price'].plot.bar(figsize=(12,7))
## Looking at the information of 11217 we can see that "House for Sale" & "Townhouse for sale" are the best option
## with regards of size and price apartments are not suitable for a business
## with that said we are going to filter out the needed information of properties on sale and lcoate them on the map.
brook_11217_zillow_filtered = brook_11217_zillow[(brook_11217_zillow.type_of_property == "Townhouse for sale") | (brook_11217_zillow.type_of_property == "House for sale")].reset_index(drop=True)
##### removing properties smaller than 2500sqft and properties beyond 4 million
brook_11217_zillow_filtered = brook_11217_zillow_filtered[(brook_11217_zillow_filtered.area_sqft >= 2500) & (brook_11217_zillow_filtered.price <= 4000000)].reset_index(drop=True)
brook_11217_zillow_filtered
## add latitude and longitude columns
brook_11217_zillow_filtered["Lat"] = ""
brook_11217_zillow_filtered["Long"]= ""
## get the latitude and longitude of the addresses above
for index, frame in brook_11217_zillow_filtered['type_of_property'].iteritems():
address_z = brook_11217_zillow_filtered.address[index]
get = getlatlong(address_z)
latz = get[0]
longz = get[1]
print (address_z, get)
brook_11217_zillow_filtered.Lat[index]=latz
brook_11217_zillow_filtered.Long[index]=longz
brook_11217_zillow_filtered
##create map with clusters
from folium import plugins
# create map
map_ALL = folium.Map(location=[latitude, longitude], zoom_start=11)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
instances = plugins.MarkerCluster().add_to(map_ALL)
# add markers to the map
#markers_colors = []
#for lat, lon, poi, cluster in zip(brook_merged_short['Lat'], brook_merged_short['Long'], brook_merged_short['neighbourhood'], brook_merged_short['Cluster Labels']):
# label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
# folium.CircleMarker(
# [lat, lon],
# radius=25,
# popup=label,
# color=rainbow[cluster-1],
# fill=True,
# fill_color=rainbow[cluster-1],
# fill_opacity=0.7).add_to(instances)
#####Adding the BnB locations
folium.Marker(
[latitude, longitude],
popup='Barclays Center',
icon=folium.Icon(color='black', icon='info-sign')
).add_to(map_ALL)
# add the Bed & Breakfast as blue circle markers
for lat, lng, label in zip(dataframe_filtered_bnb.lat, dataframe_filtered_bnb.lng, dataframe_filtered_bnb.name):
folium.Marker(
[lat, lng],
popup=label,
icon=folium.Icon(color='green', icon='info-sig')
).add_to(instances)
######
# add the properties suggested
for lat, lng, label in zip(brook_11217_zillow_filtered.Lat, brook_11217_zillow_filtered.Long, brook_11217_zillow_filtered.webpage):
folium.Marker(
[lat, lng],
popup=label,
icon=folium.Icon(color='blue', icon='cloud')
).add_to(map_ALL)
map_ALL
Brooklyn one of the 5 Boroughs of New York City , which has risen to prominence due to investments in the real estate sector and quick transport to Manhattan has become a trendy place for visitors from all over the world, we took the task to find a suitable list of properties that are located close to the Barclay center within 1.6km, with venues of interest and not crowded with BnB/Hostels.
The result of the initial analysis yielded a list of 11 B&B nearby, but mostly concentrated in the Clinton Hill-Bed-Stuy neighborhoods, which gave room to look into adjacent neighborhoods to the Barclay Center in the opposite direction and closer to Manhattan as well close to sites of interest such as museums and parks.
Once identified K-Means technique was performed to find similarities & differences between the neighborhoods, this indeed help solidify the idea of looking for properties in Cluster-0 zip code 11217, their neighborhoods offer similar venues such as Bar's, Cafe's and international restaurants to add into the visitor experience.
The web scrapping on Zillow yielded a list of 100+ for sale properties on each of the 2 Postal Codes under analysis, the 11205 (Clinton hill) and the 11217 (Gowanus, Boerum, Park Slope). Data needed some cleaning and removing outliers with wrong square-foot-area or missing information. The linear regression analysis with remaining data validate the veracity of it, since as square-feet-area increases so the price, once outliers and missing data is cleared out, this gives a good confidence to the business and decision making.
Excluded ultra-expensive and small properties out of the recommendation since it is not suitable for the proposed business. The resulting properties share similar features such as area, beds and price.
With all above completed the result is a recommendation of 9 properties that would be a good fit for a B&B business close to the Barclay Center, the pop-up in the map will take you to the Real-Estate Page of the property
Brooklyn is a vibrant place with lots of venues and multicultural heritage, yet it seems that there are room for Bed&Breakfast business to be developed providing an accessible hospitality option for younger travelers and budget mindful, who spend most of the day visiting the city more than enjoying the hotel/B&B premises.
Using different Data Analysis techniques, it was possible to create a reasonable business proposal and a recommendation to the problem at hand.
K-Means and regression are tools that help tell a story and validate the data to provide an accurate picture of the environment under study.
Tools like this can help people make the best decision for their business.
The techniques, tools and methods learned in the Coursera-IBM help on a possible real-life scenario to generate data and a story to tackle the problem proposed. This exercise could be extended and improved which is part of the journey of becoming a Data Scientist. Very good course, I started from zero and i was able to understand many concepts that are useful in my day to day work.